System and method for intelligent initiation of a man-machine dialogue based on multi-modal sensory inputs

ABSTRACT

The present teaching relates to method, system, medium, and implementations for enabling communication with a user. Information representing surrounding of a user to be engaged in a new dialogue is received via the communication platform, wherein the information is acquired from a scene in which the user is present and captures characteristics of the user and the scene. Relevant features are extracted from the information. A state of the user is estimated based on the relevant features, and a dialogue context surrounding the scene is determined based on the relevant features. A topic for the new dialogue is determined based on the user, and a feedback is generated to initiate the new dialogue with the user based on the topic, the state of the user, and the dialogue context.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of U.S. application Ser. No.16/233,678 filed on Dec. 27, 2018, which claims priority to U.S.Provisional Application No. 62/612,163, filed Dec. 29, 2017, thecontents of which are hereby incorporated by reference in theirentireties.

The present application is related to International Application No.PCT/US2018/067649, filed Dec. 27, 2018, U.S. patent application Ser. No.16/233,716, filed Dec. 27, 2018, and International Application No.PCT/US2018/067654, filed Dec. 27, 2018, which are hereby incorporated byreference in their entireties.

BACKGROUND 1. Technical Field

The present teaching generally relates to human machine communication.More specifically, the present teaching relates to adaptive humanmachine communication.

2. Technical Background

With advancement of artificial intelligence technologies and theexplosion in Internet based communications because of the ubiquitousInternet's connectivity, computer aided dialogue systems have becomeincreasingly popular. For example, more and more call centers deployautomated dialogue robot to handle customer calls. Hotels started toinstall various kiosks that can answer questions from tourists orguests. Online bookings (whether travel accommodations or theatertickets, etc.) are also more frequently done by chatbots. In recentyears, automated human machine communications in other areas are alsobecoming more and more popular.

Such traditional computer aided dialogue systems are usuallypre-programmed with certain questions and answers based on commonlyknown patterns of conversations in different domains. Unfortunately,human conversant can be unpredictable and sometimes does not follow apre-planned dialogue pattern. In addition, in certain situations, ahuman conversant may digress during the process and continue fixedconversation patterns, which will likely cause irritation or loss ofinterests. When this happens, such machine traditional dialogue systemsoften will not be able to continue to engage a human conversant so thatthe human machine dialogue either has to be aborted to hand the tasks toa human operator or the human conversant simply leaves the dialogue,which is undesirable.

In addition, traditional machine based dialogue systems are often notdesigned to address the emotional factor of a human, let alone takinginto consideration as to how to address such emotional factor whenconversing with a human. For example, a traditional machine dialoguesystem usually does not initiate the conversation unless a humanactivates the system or asks some questions. Even if a traditionaldialogue system does initiate a conversation, it has a fixed way tostart a conversation and does not change from human to human or adjustedbased on observations. As such, although they are programmed tofaithfully follow the pre-designed dialogue pattern, they are usuallynot able to act on the dynamics of the conversation and adapt in orderto keep the conversation going in a way that can engage the human. Inmany situations, when a human involved in a dialogue is clearly annoyedor frustrated, a traditional machine dialogue system is completelyunaware and continues the conversation in the same manner that hasannoyed the human. This not only makes the conversation end unpleasantly(the machine is still unaware of that) but also turns the person awayfrom conversing with any machine based dialogue system in the future.

In some applications, conducting a human machine dialogue session basedon what is observed from the human is crucially important in order todetermine how to proceed effectively. One example is an educationrelated dialogue. When a chatbot is used for teaching a child to read,whether the child is perceptive to the way he/she is being taught has tobe monitored and addressed continuously in order to be effective.Another limitation of the traditional dialogue systems is their contextunawareness. For example, a traditional dialogue system is not equippedwith the ability to observe the context of a conversation and improviseas to dialogue strategy in order to engage a user and improve the userexperience.

Thus, there is a need for methods and systems that address suchlimitations.

SUMMARY

The teachings disclosed herein relate to methods, systems, andprogramming for human machine communication. More specifically, thepresent teaching relates to adaptive human machine communication.

In one example, there is provided a method, implemented on a machinehaving at least one processor, storage, and a communication platformcapable of connecting to a network for enabling communication with auser. Information representing surrounding of a user to be engaged in anew dialogue is received via the communication platform, wherein theinformation is acquired from a scene in which the user is present andcaptures characteristics of the user and the scene. Relevant featuresare extracted from the information. A state of the user is estimatedbased on the relevant features, and a dialogue context surrounding thescene is determined based on the relevant features. A topic for the newdialogue is determined based on the user, and a feedback is generated toinitiate the new dialogue with the user based on the topic, the state ofthe user, and the dialogue context.

In a different example, there is provided a system for enablingcommunication with a user. The system includes a multimodal dataanalysis unit that is configured for receiving information representingsurrounding of a user to be engaged in a new dialogue, wherein theinformation is acquired from a scene in which the user is present andcaptures characteristics of the user and the scene and extractingrelevant features from the information. The system includes a user stateestimator configured for estimating a state of the user based on therelevant features, and a dialogue contextual info determiner configuredfor determining a dialogue context surrounding the scene based on therelevant features. The system also includes a dialogue controllerconfigured for determining a topic for the new dialogue based on theuser and generating a feedback to initiate the new dialogue with theuser based on the topic, the state of the user, and the dialoguecontext.

Other concepts relate to software for implementing the present teaching.A software product, in accord with this concept, includes at least onemachine-readable non-transitory medium and information carried by themedium. The information carried by the medium may be executable programcode data, parameters in association with the executable program code,and/or information related to a user, a request, content, or otheradditional information.

In one example, there is provided a machine readable and non-transitorymedium coded with information for enabling communication with a user,wherein the information, once read by the machine, causes the machine toperform a series of steps. Information representing surrounding of auser to be engaged in a new dialogue is received via the communicationplatform, wherein the information is acquired from a scene in which theuser is present and captures characteristics of the user and the scene.Relevant features are extracted from the information. A state of theuser is estimated based on the relevant features, and a dialogue contextsurrounding the scene is determined based on the relevant features. Atopic for the new dialogue is determined based on the user, and afeedback is generated to initiate the new dialogue with the user basedon the topic, the state of the user, and the dialogue context.

Additional advantages and novel features will be set forth in part inthe description which follows, and in part will become apparent to thoseskilled in the art upon examination of the following and theaccompanying drawings or may be learned by production or operation ofthe examples. The advantages of the present teachings may be realizedand attained by practice or use of various aspects of the methodologies,instrumentalities and combinations set forth in the detailed examplesdiscussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The methods, systems and/or programming described herein are furtherdescribed in terms of exemplary embodiments. These exemplary embodimentsare described in detail with reference to the drawings. Theseembodiments are non-limiting exemplary embodiments, in which likereference numerals represent similar structures throughout the severalviews of the drawings, and wherein:

FIG. 1 depicts connections among a user device, an agent device, and auser interaction system during a dialogue, in accordance with anembodiment of the present teaching;

FIG. 2 depicts an exemplary high level system diagram for an automateddialogue companion with multiple layer processing capabilities,according to an embodiment of the present teaching;

FIG. 3 illustrates a dialogue process in which contextual information isobserved and used to infer the situation in order to devise adaptivedialogue strategy, according to an embodiment of the present teaching;

FIG. 4 illustrates exemplary multiple layer processing andcommunications among different processing layers of an automateddialogue companion, according to an embodiment of the present teaching;

FIG. 5 depicts an exemplary high level system diagram of a human machinedialogue framework, according to an embodiment of the present teaching;

FIG. 6 is a flowchart of an exemplary process of a human machinedialogue framework, according to an embodiment of the present teaching;

FIG. 7A illustrates exemplary types of multimodal data that can beacquired during a dialogue, according to an embodiment of the presentteaching;

FIG. 7B illustrates exemplary types of user state information, accordingto an embodiment of the present teaching;

FIG. 7C illustrates exemplary types of conversation environment that maybe observed and used as contextual information, according to anembodiment of the present teaching;

FIG. 8 depicts an exemplary high level system diagram of a multimodaldata analyzer, according to an embodiment of the present teaching;

FIG. 9 is a flowchart of an exemplary process of a multimodal dataanalyzer, according to an embodiment of the present teaching;

FIG. 10 depicts an exemplary high level system diagram of a dialoguecontroller, according to an embodiment of the present teaching;

FIG. 11 is a flowchart of an exemplary process of a dialogue controller,according to an embodiment of the present teaching;

FIG. 12 depicts the architecture of a mobile device which can be used toimplement a specialized system incorporating the present teaching; and

FIG. 13 depicts the architecture of a computer which can be used toimplement a specialized system incorporating the present teaching.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are setforth by way of examples in order to provide a thorough understanding ofthe relevant teachings. However, it should be apparent to those skilledin the art that the present teachings may be practiced without suchdetails or with different details related to design choices orimplementation variations. In other instances, well known methods,procedures, components, and/or hardware/software/firmware have beendescribed at a relatively high-level, without detail, in order to avoidunnecessarily obscuring aspects of the present teachings.

The present disclosure generally relates to systems, methods, medium,and other implementations directed to illustrated embodiments of thepresent teaching, the related concepts are presented in a human machinedialogue in which the present teaching may be deployed. Specifically,the present teaching is related to the concept of initiating a humanmachine dialogue adaptively in accordance with the observedsurroundings, including the state of the person to be engaged in thedialogue, the settings of the environment, etc. In addition, during thedialogue, the person and the setting of the dialogue are continuouslyobserved, analyzed, and used to adaptively conduct the dialogueaccordingly in order to enhance user experience and improve userengagement. Although certain exemplary figures and disclosures arepresented to describe the concepts associated with the present teaching,it is understood that the present teaching can be applied to any settingdifferent from what is presented herein without limitation.

FIG. 1 depicts a general framework for human machine dialogue withinformation flow among different parties, according to an embodiment ofthe present teaching. As shown, the framework comprises a user device110, an agent device 160, and a user interaction system 170 that areconnected via network connections during a dialogue. The user device maybe deployed with various sensors in one or more modalities, e.g.,sensor(s) 140 acquiring visual data (images or video), sensor(s) 130acquiring acoustic data including the utterances of the user or soundfrom the dialogue environment, sensors acquiring text information (whatthe user wrote on a display), or haptic sensor (not shown) capturingtouch, movement, etc. of the user. Such sensor data may be capturedduring the dialogue and used to facilitate an understanding of the user(expression, emotion, intent) and the surrounding of the user. Suchsensor data provide contextual information and can be explored tocustomize a response accordingly.

In a variant embodiment, a human conversant may communicate directlywith the user interaction system 170 without the agent device 160. Whenthe agent device is present, it may correspond to a robot which may becontrolled by the user interaction system 170. In some embodiments, theuser device 110 may communicate directly with the user interactionsystem 170 without the agent device. In such an embodiment, thefunctionalities to be performed by the agent device (e.g., to utter aword or a sentence, to convey text information, or to express certainemotions such as frowning, smile, or sad) may be performed by the userinteraction system 170. Thus, although the disclosure below may discusscertain concepts with respect to either the agent device or the userinteraction system, the concepts as disclosed may be applied to eitheror a combination of the two. Moreover, in the embodiments describedherein, reference to a machine side may mean either the agent device orthe user interaction system, or a combination thereof.

As depicted, connections between any two of the parties in FIG. 1 may bebi-directional. The agent device 160 may interface with a user via auser device 110 to carry out a dialogue in a bi-directional manner. Inoperation, the agent device 160 may be controlled by the userinteraction engine 170 to, e.g., utter a response to the user operatingthe user device 110. According to the present teaching, inputs from theuser's side, including, e.g., the user's utterance or action, theappearance of the user, as well as information about the surrounding ofthe user, are provided to the agent device 160 or to the userinteraction system 170 via network connections.

In some embodiments, the agent device 160 or the user interaction system170 may be configured to process such input and use relevant informationidentified from such input to dynamically and adaptively generate aresponse to the user. Such a response includes, e.g. an utterance toconvey a verbal response or a change to be made in the agent device torender a different scene to be displayed to the user. For example, theagent device or the user interaction system 170 may observe, from theinput from the user device, that the user is wearing a yellow shirt andappears to be bored. An adaptive response from the agent device or theuser interaction system may be generated that comment on how nice theuser looks. Such a response may re-direct the user's attention to thedialogue. As another example, assume that the response from the machine(either the agent device or the user interaction system) is to render anobject e.g., a tree on the user device. Based on the sensor input fromthe user device, it may be determined that the surrounding of the userindicates that it is winter season. Knowing that, the machine mayadaptively instruct the user device to render the tree without leaves(to be consistent with the surrounding of the user).

As another example, if a machine response is to render a duck on theuser device, information about the user may be retrieved, e.g., from theuser information database 130, on user's color preference and providerendering instructions to the user device by customizing the duck in auser's preferred color (e.g., yellow or brown). Such customization,based on either on-the-fly observations of the user or the surroundingthereof or user's known preferences, is performed adaptively withrespect to the user and the current dialogue environment.

In this manner, the machine side (either the agent device or the userinteraction system or a combination thereof) may base its decision onhow to respond to a user on what is observed during the dialogue. Basedon the input from the user device, the machine side may determine thestate of the dialogue and expression/emotion/mindset of the user andgenerate a response that is based on the specific situation of thedialogue and the intended purpose of the dialogue (e.g., for teaching achild the English vocabulary) given the observed situation. In someembodiments, if information received from the user device indicates thatthe user appears to be bored and impatient, machine side may change thestate of a dialogue related to one topic (e.g., geometry in a matheducation related dialogue) to a different topic that may be of interestto the user, e.g., basketball. Such a switch of topic may be determinedbased on an observation that the user is gazing at a basketball in theroom. The switch of topic thereby continues to engage the user in theconversation.

In some embodiments, the user device may be configured to process rawsensor data acquired in different modalities and send the processedinformation (e.g., relevant features of the raw inputs) to the agentdevice or the user interaction engine for further processing. This willreduce the amount of data transmitted over the network and enhance thecommunication efficiency.

As shown, during a dialogue between the user and the machine, the userdevice 110 may continually collect multi-modal sensor data related tothe user and his/her surroundings, which may be analyzed to detect anyinformation relevant to the dialogue and used to intelligently controlthe dialogue in an adaptive manner. This may enhance the user experienceand/or engagement. The sensor data provide contextual informationsurrounding the dialogue and enables the machine to understand thesituation in order to manage the dialogue more effectively.

FIG. 2 depicts an exemplary high level system diagram for an automateddialogue companion with multiple layer processing capabilities,according to an embodiment of the present teaching. In this illustratedembodiment, the overall system may encompass components/function modulesresiding in the user device 110, the agent device 160, and the userinteraction engine 170. The overall system as depicted herein comprisesa plurality of layers of processing and hierarchies that together carryout human-machine interactions in an intelligent manner. In theillustrated embodiment, there are 5 layers, including layer 1 for frontend application as well as front end multi-modal data processing, layer2 for characterizations of the dialog setting, layer 3 is where thedialog management module resides, layer 4 for estimated mindset ofdifferent parties (human, agent, device, etc.), layer 5 for so calledutility. Different layers may correspond to different levels ofprocessing, ranging from raw data acquisition and processing at layer 1to processing changing utilities of participants of dialogues at layer5.

The term “utility” is hereby defined as preferences of a partyidentified based on detected states, which are associated with dialoguehistories. Utility may be associated with a party in a dialogue, whetherthe party is a human, the automated companion, or other intelligentdevices. A utility for a particular party may represent different statesof a world, whether physical, virtual, or even mental. For example, astate may be represented as a particular path along which a dialog walksthrough in a complex map of the world. At different instances, a currentstate evolves into a next state based on the interaction betweenmultiple parties. States may also be party dependent, i.e., whendifferent parties participate in an interaction, the states arising fromsuch interaction may vary. A utility associated with a party may beorganized as a hierarchy of preferences and such a hierarchy ofpreferences may evolve over time based on the party's choices made andlikings exhibited during conversations. Such preferences, which may berepresented as an ordered sequence of choices made out of differentoptions, is what is referred to as utility. The present teachingdiscloses method and system by which an intelligent automated companionis capable of learning, through a dialogue with a human conversant, theuser's utility.

Within the overall system as depicted in FIG. 2, front end applicationsas well as front end multi-modal data processing in layer 1 may residein the user device 110 and/or the agent device 160. For example, thecamera, microphone, keyboard, display, renderer, speakers, chat-bubble,and user interface elements may be components or functional modules ofthe user device. For instance, there may be an application or clientrunning on the user device which may include the functionalities beforean external application interface (API) as shown in FIG. 2. In someembodiments, the functionalities beyond the external API may beconsidered as the backend system or reside in the user interactionengine 170. The application running on the user device may takemulti-model data (audio, images, video, text) from the sensors orcircuitry of the user device, process the multi-modal data to generatetext or other types of signals (object such as detected user face,speech understanding result) representing features of the rawmulti-modal data, and send to layer 2 of the system.

In layer 1, multi-modal data may be acquired via sensors such as camera,microphone, keyboard, display, speakers, chat bubble, renderer, or otheruser interface elements. Such multi-modal data may be analyzed toestimate or infer various features that may be used to infer higherlevel characteristics such as expression, characters, gesture, emotion,action, attention, intent, etc. Such higher level characteristics may beobtained by processing units at layer 2 and then used by components ofhigher layers, via the internal API as shown in FIG. 2, to e.g.,intelligently infer or estimate additional information related to thedialogue at higher conceptual levels. For example, the estimatedemotion, attention, or other characteristics of a participant of adialogue obtained at layer 2 may be used to estimate the mindset of theparticipant. In some embodiments, such mindset may also be estimated atlayer 4 based on additional information, e.g., recorded surroundingenvironment or other auxiliary information in such surroundingenvironment such as sound.

The estimated mindsets of parties, whether related to humans or theautomated companion (machine), may be relied on by the dialoguemanagement at layer 3, to determine, e.g., how to carry on aconversation with a human conversant. How each dialogue progresses oftenrepresent a human user's preferences. Such preferences may be captureddynamically during the dialogue at the utility layer (i.e., layer 5). Asshown in FIG. 2, utilities at layer 5 represent evolving states that areindicative of parties' evolving preferences, which can also be used bythe dialogue management at layer 3 to decide the appropriate orintelligent way to carry on the interaction.

Sharing of information among different layers may be accomplished viaAPIs. In some embodiments as illustrated in FIG. 2, information sharingbetween layer 1 and rest of the layers is via an external API whilesharing information among layers 2-5 is via an internal API. It isunderstood that this merely a design choice and other implementationsare also possible to realize the present teaching presented herein. Insome embodiments, through the internal API, various layers (2-5) mayaccess information created by or stored at other layers to support theprocessing. Such information may include common configuration to beapplied to a dialogue (e.g., character of the agent device is an avatar,preferred voice, or a virtual environment to be created for thedialogue, etc.), a current state of the dialogue, a current dialoguehistory, known user preferences, estimated user intent/emotion/mindset,etc. In some embodiments, some information that may be shared via theinternal API may be accessed from an external database. For example,certain configurations related to a desired character for the agentdevice (e.g., a duck) may be accessed from, e.g., an open sourcedatabase, that provide parameters (e.g., parameters to visually renderthe duck and/or parameters needed to render the speech from the duck).

FIG. 3 illustrates a dialogue process in which contextual information isobserved and used to infer the situation in order to devise adaptivedialogue strategy, according to an embodiment of the present teaching.As seen from FIG. 3, operations at different layers may be conducted andtogether they facilitate intelligent dialogue in a cooperated manner. Inthe illustrated example, while observing that the user is in a normalmode, the 160 agent device may first ask a user “How are you doingtoday?” at 302 to initiate a conversation. In some embodiments asdisclosed herein, if the user appears to be in a bad mode, the machine(either the agent device 160 or the user interaction system 170) mayinitiate using a different sentence, e.g., “Are you okay?” (not shown).In response to utterance at 302, the user may respond with utterance“Ok” at 304. To manage the dialogue, the automated dialogue machine mayactivate different sensors for capturing the dynamic situation duringthe dialogue to enable observation of the user and the surroundingenvironment. For example, the automated dialogue machine may activatemulti-modal sensors to acquire multimodal data. Such multi-modal datamay include audio, visual, or text data. For example, visual data maycapture the facial expression of the user. The visual data may alsoreveal contextual information surrounding the scene of the conversation.A picture of the scene may reveal that there is a basketball, a table,and a chair, which provides information about the environment and may beleveraged in dialogue management to enhance engagement of the user.Audio data may capture not only the speech response of the user but alsoother peripheral information such as the tone of the response, themanner by which the user utters the response, or the accent of the user.

Based on acquired multi-modal data, analysis may be performed by theautomated dialogue machine (e.g., by the front end user device or by thebackend user interaction engine 140) to assess the attitude, emotion,mindset, and utility of the users. For example, based on visual dataanalysis, the automated dialogue machine may detect that the userappears sad, not smiling, the user's speech is slow with a low voice.The characterization of the user's states in the dialogue may beperformed at layer 2 based on multi-model data acquired at layer 1.Based on such detected observations, the automated companion may infer(at 306) that the user is not that interested in the current topic andnot that engaged. Such inference of emotion or mental state of the usermay, for instance, be performed at layer 4 based on characterization ofthe multi-modal data associated with the user.

To respond to the user's current state (appearing not engaged), theautomated dialogue machine may determine to perk up the user in order tobetter engage the user. In this illustrated example, the automateddialogue machine may leverage what is available in the conversationenvironment by uttering a question to the user at 308: “Would you liketo play a game?” Such a question may be delivered in an audio form asspeech by converting text to speech, e.g., using customized voicesindividualized for the user. In this case, the user may respond byuttering, at 310, “Ok.” Based on the continuously acquired multi-modeldata related to the user, it may be observed, e.g., via processing atlayer 2, that in response to the invitation to play a game, the user'seyes appear to be wandering, and in particular that the user's eyes maygaze towards where the basketball is located. At the same time, theautomated companion may also observe that, once hearing the suggestionto play a game, the user's facial expression changes from “sad” to“smiling.” Based on such observed characteristics of the user, theautomated companion may infer, at 312, that the user is interested inbasketball.

Based on the acquired new information and the inference based on that,the automated companion may decide to leverage the basketball availablein the environment to make the dialogue more engaging for the user yetstill achieving the educational goal for the user. In this case, thedialogue management at layer 3 may adapt the conversion to talk about agame and leverage the observation that the user gazed at the basketballin the room to make the dialogue more interesting to the user yet stillachieving the goal of, e.g., educating the user. In one exampleembodiment, the automated companion generate a response, suggesting theuser to play a spelling game” (at 314) and asking the user to spell theword “basketball.”

Given the adaptive dialogue strategy of the automated companion in lightof the observations of the user and the environment, the user mayrespond providing the spelling of word “basketball.” (at 316).Observations are continuously made as to how enthusiastic the user is inanswering the spelling question. If the user appears to respond quicklywith a brighter attitude, determined based on, e.g., multi-modal dataacquired when the user is answering the spelling question, the automatedcompanion may infer, at 318, that the user is now more engaged. Tofurther encourage the user to actively participate in the dialogue, theautomated companion may then generate a positive response “Great job!”with instruction to deliver this response in a bright, encouraging, andpositive voice to the user.

FIG. 4 illustrates exemplary multiple layer processing andcommunications among different processing layers of an automateddialogue system, according to an embodiment of the present teaching. Thedialogue manager 410 in FIG. 4 corresponds to a functional component ofthe dialogue management at layer 3. A dialog manager is an importantpart of the automated companion and it manages dialogues. Traditionally,a dialogue manager takes in as input a user's utterances and determineshow to respond to the user. This is performed without taking intoaccount the user's preferences, user's mindset/emotions/intent, orsurrounding environment of the dialogue, i.e., given any weights to thedifferent available states of the relevant world. The lack of anunderstanding of the surrounding world often limits the perceivedauthenticity of or engagement in the conversations between a human userand an intelligent agent.

In some embodiments of the present teaching, the utility of parties of aconversation relevant to an on-going dialogue is exploited to allow amore personalized, flexible, and engaging conversion to be carried out.It facilitates an intelligent agent acting in different roles to becomemore effective in different tasks, e.g., scheduling appointments,booking travel, ordering equipment and supplies, and researching onlineon various topics. When an intelligent agent is aware of a user'sdynamic mindset, emotions, intent, and/or utility, it enables the agentto engage a human conversant in the dialogue in a more targeted andeffective way. For example, when an education agent teaches a child, thepreferences of the child (e.g., color he loves), the emotion observed(e.g., sometimes the child does not feel like continuing the lesson),the intent (e.g., the child is reaching out to a ball on the floorinstead of focusing on the lesson) may all permit the education agent toflexibly adjust the focus subject to toys and possibly the manner bywhich to continue the conversation with the child so that the child maybe given a break in order to achieve the overall goal of educating thechild.

As another example, the present teaching may be used to enhance acustomer service agent in its service by asking questions that are moreappropriate given what is observed in real-time from the user and henceachieving improved user experience. This is rooted in the essentialaspects of the present teaching as disclosed herein by developing themeans and methods to learn and adapt preferences or mindsets of partiesparticipating in a dialogue so that the dialogue can be conducted in amore engaging manner.

Dialogue manager (DM) 410 is a core component of the automatedcompanion. As shown in FIG. 4, DM 410 (layer 3) takes input fromdifferent layers, including input from layer 2 as well as input fromhigher levels of abstraction such as estimated mindset from layer 4 andutilities/preferences from layer 5. As illustrated, at layer 1,multi-modal information is acquired from sensors in different modalitieswhich is processed to, e.g., obtain features that characterize the data.This may include signal processing in visual, acoustic, and textualmodalities.

Processed features of the multi-modal data may be further processed atlayer 2 to achieve language understanding and/or multi-modal dataunderstanding including visual, textual, and any combination thereof.Some of such understanding may be directed to a single modality, such asspeech understanding, and some may be directed to an understanding ofthe surrounding of the user engaging in a dialogue based on integratedinformation. Such understanding may be physical (e.g., recognize certainobjects in the scene), perceivable (e.g., recognize what the user said,or certain significant sound, etc.), or mental (e.g., certain emotionsuch as stress of the user estimated based on, e.g., the tune of thespeech, a facial expression, or a gesture of the user).

The modal-data understanding generated at layer 2 may be used by DM 410to determine how to respond. To enhance engagement and user experience,the DM 410 may also determine a response based on the estimated mindsetof the user from layer 4 as well as the utilities of the user engaged inthe dialogue from layer 5. An output of DM 410 corresponds to anaccordingly determined response to the user. To deliver a response tothe user, the DM 410 may also formulate a way that the response is to bedelivered. The form in which the response is to be delivered may bedetermined based on information from multiple sources, e.g., the user'semotion (e.g., if the user is a child who is not happy, the response maybe rendered in a gentle voice), the user's utility (e.g., the user mayprefer speech in certain accent similar to his parents'), or thesurrounding environment that the user is in (e.g., noisy place so thatthe response needs to be delivered in a high volume). DM 410 may outputthe response determined together with such delivery parameters.

In some embodiments, the delivery of such determined response isachieved by generating the deliverable form(s) of each response inaccordance with various parameters associated with the response. In ageneral case, a response is delivered in the form of speech in somenatural language. A response may also be delivered in speech coupledwith a particular nonverbal expression as a part of the deliveredresponse, such as a nod, a shake of the head, a blink of the eyes, or ashrug. There may be other forms of deliverable form of a response thatis acoustic but not verbal, e.g., a whistle.

To deliver a response, a deliverable form of the response may begenerated via, e.g., verbal response generation and/or behavior responsegeneration, as depicted in FIG. 4. Such a response in its determineddeliverable form(s) may then be used by a renderer to actually renderthe response in its intended form(s). For a deliverable form in anatural language, the text of the response may be used to synthesize aspeech signal via, e.g., text to speech techniques, in accordance withthe delivery parameters (e.g., volume, accent, style, etc.). For anyresponse or part thereof that is to be delivered in a non-verbalform(s), e.g., with a certain expression, the intended non-verbalexpression may be translated into, e.g., via animation, control signalsthat can be used to control certain parts of the agent device (physicalrepresentation of the automated companion) to perform certain mechanicalmovement to deliver the non-verbal expression of the response, e.g.,nodding head, shrug shoulders, or whistle. In some embodiments, todeliver a response, certain software components may be invoked to rendera different facial expression of the agent device. Such rendition(s) ofthe response may also be simultaneously carried out by the agent (e.g.,speak a response with a joking voice and with a big smile on the face ofthe agent).

FIG. 5 depicts an exemplary high level system diagram for the dialoguemanager 410 for personalized dialogue management, according to anembodiment of the present teaching. In this illustrated embodiment, thedialogue manager 410 comprises a multimodal data analysis unit 510, auser state estimator 520, a dialogue contextual info determiner 530, adialogue controller 540, a feedback instruction generator 550, and adialogue response output unit 560. The dialogue manager 410 as depictedherein controls both how to intelligently initiate a conversation with auser based on what is observed as well as how to respond to a userduring a conversation based on what is observed.

The multimodal data to be analyzed by the multimodal data analysis unit510 may include acoustic signals, which may correspond to environmentalsound or speech, visual signals which may include video and pictureimages, text information or information from a haptic sensor thatcharacterizes, e.g., movement of finger, hands, head, or other differentparts of the user's body. This is illustrated in FIG. 7A.

The user state estimator 520 and the dialogue contextual info determiner530 estimate, based on the processed multimodal data from the multimodaldata analysis unit 510, the user's state and contextual informationsurrounding the dialogue scene, respectively. FIG. 7B illustratesexemplary types of user state information, according to an embodiment ofthe present teaching. A user's state may include the physical appearanceof the user, including the skin color, hair color/style, facial tone,eye color, cloth (color, style), . . . , etc. A user's state may alsorelate to the expression of the user, such as smile, frowning, winking,confusion, loudness of the voice, . . . , etc. The user's state may alsobe related to what the user said and in what manner. Based on the audiosignal, speech recognition may be performed based on the audio signal todetermine the text of the user's utterance. The pitch and tone of thespeech may also be recognized which may be used, together with visualcues from the visual data, to estimate the user's expression or emotion.The expressions of a user may be used to infer his/her emotion such ashappy, sad, upset, bored, . . . , etc. In some situations,expressions/emotions may also be used to infer intent. For example, if auser's expression indicates that he/she is bored but his/her eyes gazedat a ball nearby, it may be inferred that the user is interested intalking about a topic related to the ball (e.g., basketball game) whichis related to intent.

To facilitate the user state estimator 520, appropriate models may beprovided in 515. The estimation of the user's state may be performedbased on different models. In the illustrated embodiments, the userstate detection models may include model(s) for detecting features ofdifferent aspects related to the user, including but is not limited to,appearance (color of the cloth the user is wearing, skin color/tone,gender, hair color, eye color, etc.), facial expressions (smile,frowning, crying, sad, . . . , etc.), emotions (sad, happy, excited,angry, . . . , etc.), modes (upset, indifferent, interested, distracted,. . . , etc.). Based on such models, different information or featuresmay be extracted and utilized to make a determination as to the state ofthe user.

To estimate the contextual information surrounding the dialogue scene,the dialogue contextual info determiner 530 detects, based onenvironment detection models 525, various objects present in the scenecaptured by the multimodal data and extracts relevant features thereof.To facilitate the determination of the contextual information of thedialogue environment that the user is in, the environment detectionmodels 525 may be appropriately provided for detecting different typesof scenes, different types of objects, different characterizations ofenvironments, etc. FIG. 7C illustrates exemplary types of contextualinformation and/or environment detection models, according to anembodiment of the present teaching. As seen, a dialogue environment mayimply a hierarchy of concepts at different conceptual levels, rangingfrom lower level concepts such as objects present in the dialogueenvironment (computer, desk, table, ball, chair, cabinet, tree, lawn,sky, lake, children, swing, slide, mountain, . . . , etc.), places(office, park, beach, wildness, store, playground, . . . , etc.), tonature of a place (e.g., vacation place, work place, transit place, . .. , etc.). Using such models, the dialogue contextual info determiner530 may detect not only what is present in the scene around the user butalso a characterization of the place and the nature of the place withrespect to the user.

Referring back to FIG. 5, the estimated user state (from the user stateestimator 520) and the determined contextual information of theunderlying dialogue (from the dialogue contextual info determiner 530)may then be utilized by the dialogue controller 540 to either adaptivelyinitiate a conversation with a user or determine how to respond to thelast response from the user. The determination of a feedback (eitherwhat is to be said to the user to initiate a conversion or a response torespond to the last utterance from the user) is made based also on theprogram 535 currently applied to the conversation (e.g., an educationsession on math), the dialogue tree 545 (which provides a general flowof the conversation), and user information such as preferences from userprofile archive 555. Such generated feedback is then processed at thefeedback instruction generator 550 to generate instructions for thefeedback (e.g., rendering instruction to be sent to the user device torender the machine's response), which is then sent to the user devicevia the dialogue response output unit 560.

FIG. 6 is a flowchart of an exemplary process of the dialogue manager410, according to an embodiment of the present teaching. When themultimodal data analysis unit 510 receives, at 610, multimodal sensordata from a user device, it analyzes the multimodal sensor data at 620.Based on the analyzed multimodal sensor data, the user state estimator520 estimates, at 630, the state of the user based on user statedetection models 515. As discussed herein, the state of the userincludes characterizations of the user at different conceptual levels,some at appearance level, some emotional, etc.

Based on the analyzed multimodal sensor information, the dialoguecontextual info determiner 530 determines, at 640, contextualinformation of the dialogue. In order to determine a feedback to theuser (either the initial utterance to be said to the user or a responseresponding to the last utterance of the user), the dialogue controller540 determines, at 650, a program/topic associated with the currentdialogue. From the program so determined, the dialogue controller 540may access an appropriate dialogue tree corresponding to the program andidentify the node in the dialogue tree corresponding to the currentconversation. Based on the dialogue tree, the user state, and thecontextual information, the dialogue controller 540 determines, at 660,the progression of the dialogue. There are two possible dialoguescenarios. One is that the dialogue session is at the very beginning,e.g., the user just appeared and the dialogue controller 540 needs toinitiate a dialogue according to what is observed. The other scenario isthat the dialogue is on-going so that the dialogue controller 540 is todetermine how to respond to the user.

For a new conversation wherein, the dialogue controller 540 is toinitiate the conversation, if the user newly appears, the dialoguecontroller 540 is to initiate the conversion. In this situation, theidentified dialogue tree is at its initiate node and the dialoguecontroller 540 is to determine whether what is dictated by the firstnode of the dialogue tree is appropriate given the estimated user state.For example, if the first utterance of the machine according to thedialogue tree is “Are you ready for today's program?” but the estimateduser state may indicate that the user is currently really upset, thedialogue controller 540 may determine not to adopt what is dictated inthe dialogue tree and instead generate a different initiating sentencebased on, e.g., user's preference from the user profile archive 555and/or something detected in the dialogue scene such as user'sappearance or some objects in the environment of the user. For instance,the user may be known to love to play basketball (from the user profilearchive 555) and there is a ball observed in the scene (contextualinformation from the dialogue contextual info determiner 530), thedialogue controller 540 may determine to initiate the conversion bysaying “How would you like to play the ball for a while?” to engage theuser.

If the dialogue is an on-going session, the dialogue controller 540 maydetermine a response based on, e.g., whether the user state isappropriate in order to follow the dialogue tree of the currentdialogue. If the estimated user state suggests that it is notappropriate to follow the dialogue tree, e.g., the last utterance fromthe user does not correspond to any of the predicted answers in thedialogue tree and the user appears to be lost, the dialogue controller540 may determine not to follow the dialogue tree and instead identifysome alternative response. Such an alternative response may bedetermined adaptively according to the user state and the contextualinformation. For instance, the user state may indicate that the user iswearing a swimming suit and the contextual information indicates thatthe user is near a swimming pool. To engage the user in an Englishlanguage art education dialogue session, the dialogue controller 540 maydetermine to engage the user in the program by responding “Would you letme know how to spell words ‘swim’ and ‘pool’?” By adapting the dialoguein accordance with the user state (the lost emotion, the appearance ofwearing a swimming suit) and the context of the dialogue (there is apool in the environment) to identify the alternative response to theuser may re-orient the user's attention and improve engagement and userexperience.

Based on the feedback that the dialogue controller 540 determined(either how to initiate the dialogue or how to respond to the user in anon-going dialogue), the feedback instruction generator 550 generates, at670, instructions for the user device that can be used to render thefeedback to the user and send the instruction to the dialogue responseoutput unit 560, which then sends, at 680, the instruction to the userdevice.

FIG. 8 depicts an exemplary high level system diagram of the multimodaldata analysis unit 510, according to an embodiment of the presentteaching. As discussed herein, the multimodal data analysis unit 510 isfor analyzing the multimodal sensor data, extracting relevantinformation, and identifying useful cues that can be used in adaptingthe dialogue. Such cues include both features related to the user andfeatures related to the environment. Based on such multimodal features(visual, acoustic, text, or haptic), the multimodal data analysis unit510 may also infer, e.g., based on models, higher level concepts such asexpressions, emotions, intent, etc. In the illustrated embodiment, themultimodal data analysis unit 510 comprises raw data processing portion,feature extraction portion, and inference portion.

The raw data processing portion comprises an audio information processor800, a visual information processor 805, and a text informationprocessor 810. Additional raw sensor data processing units may also beincluded such as a processor for processing, e.g., haptic information(not shown). The processors for processing raw data may be for low levelsignal processing for, e.g., removing noises, enhancing picturequalities, sub-sampling for reducing the redundant data, etc.

The feature extraction portion may take the output from the rawprocessing units as input, extract features (face, eyes, skin, sky,tree, words from speech or text, measures related to the speech such asspeed, clarity, loudness, pitch, . . . , etc.) relevant to certaincharacteristics of interests (e.g., expression such as smile,appearances such as eye color, skin tone/color, surrounding objects suchas desk, tree color, pool, . . . , etc.). This portion may include anacoustic sound analyzer 815 (for analyzing environmental sounds), aspeech recognition unit 820 (for recognizing the utterance of the userand/or of others), a user facial feature detector 825 (for detectingface(s), facial features, and facial expressions), a user appearanceanalyzer 830 (such as clothing, skin color, etc.), and a surroundingfeature detector 835 (for detecting, e.g., different types of objectsthat appear in the dialogue environment, etc.). These components detectfeatures in respective modalities and such detected features are sent tothe inference portion to infer higher level concepts such as emotion,intent, etc.

The inference portion in this illustrated embodiment includes an emotionestimator 840, which may infer emotion of the user based on visual andacoustic features from components 815, 820, 825, and 830, and an intentestimator 850, which may infer intent based on not only the visual andacoustic features from these four components but also the estimatedemotion from the emotion estimator 840. For instance, the user facialfeature detector 825 may have detected that the user is gazing in acertain direction where there is a ball and the emotion estimator 840may infer that the user is confused on the substance of the dialogue.Based on such visual cues, the intent estimator 850 may infer that theuser's attention is not on the dialogue and desires (intent) to play theball. Such inferred emotion and intent can be used by the dialoguecontroller 540 to determine how to continue the dialogue and re-orientthe user's attention, as discussed herein. In some embodiments, higherlevel inference such as estimating emotions or intent may be implementedin the user state estimator 520.

The multimodal data analysis unit 510 may also include a relevant cueidentifier 860, which takes input from all components in both featureextraction portion and inference portion (i.e., 815, 820, 825, 830, 835,840, and 850) and identifies cues that are relevant to the dialoguebased on, e.g. dialogue models/trees 837. For instance, if a dialogue isabout a travel plan, certain features related to the user (e.g., eyecolor) or present in the dialogue scene (e.g., a toy duck on a desk) maynot be relevant to the dialogue management. In this case, the relevantcue identifier 860 may filter such features out without sending them tothe user state estimator 520 or the dialogue contextual info determiner530 (see FIG. 5).

FIG. 9 is a flowchart of an exemplary process of the multimodal dataanalysis unit 510, according to an embodiment of the present teaching.Upon receiving the multimodal data at 910, components in the multimodaldata analysis unit 510 process different information to respectivelyextract and identify relevant information. For example, as seen in FIG.8, to identify useful cues in the audio domain, the audio informationprocessor 800, the acoustic sound analyzer 815, and the speechrecognition unit 820 process, at 940, the audio signal to recognizeenvironmental sounds and/or speech of the user. To identify useful cuesin the visual domain, the visual information processor 805, the userfacial feature detector 825, and the user appearance analyzer 830process, at 930, the visual data to identify visual features of the user(such as eyes, nose, skin color, etc.), facial expressions, and userappearance such as hair, cloth, etc. To identify useful cues in thesurrounding environment of the user, the surrounding feature detector835 processes, at 920, data from multiple domains to extract relevantfeatures.

Based on the detected audio and visual features, the emotion estimator840 estimates, at 950, the emotion of the user. The intent estimator 850estimates, at 960, the intent of the user based on audio/visual featuresand/or the estimated emotion of the user. The relevant cue identifier860 then filter the features and estimates to identify, at 970, relevantvisual/audio/environmental cues and sends, at 980, such relevant cues todifferent components in the dialogue manager 410 (see FIG. 5).

FIG. 10 depicts an exemplary high level system diagram of the dialoguecontroller 540, according to an embodiment of the present teaching. Inthis illustrated embodiment, the dialogue controller 540 comprises auser information processor 1000, a contextual info processor 1010, adialogue progression determiner 1040, a dialogue initiation determiner1030, a dialogue response determiner 1020, and a dialogue feedbackgenerator 1050. As discussed herein, the dialogue controller 540determines whether it is to initiate a dialogue or generate a responseto the user in an on-going dialogue session. In each situation, thedialogue controller 540 is to produce a feedback, where the feedback isthus either an initial greeting directed to the user in the event ofstarting a new dialogue or a response generated to respond to the userin an on-going dialogue. In either situation, the feedback is generatedintelligently and adaptively based on the user state and the contextualinformation from the scene.

FIG. 11 is a flowchart of an exemplary process of a dialogue controller,according to an embodiment of the present teaching. In operation, theuser information processor 1000 receives, at 1100, user stateinformation (from the user state estimator 520 in FIG. 5) and the userutilities/preferences from layer 5 (see FIGS. 2 and 4). Based on thereceived information, the user information processor 1000 analyzes, at1110, such information to identify different types of relevantinformation, which may include, e.g., the current mental state of theuser (bored), utilities/preferences of the user (likes to interminglefun things with learning in order to be effective), the appearance ofthe user (wearing a yellow swimming suit and a red swimming cap), thescheduled program to cover is in English, and/or that the current topicof the user's scheduled program is spelling of simple words in English(if it is an on-going dialogue), . . . , etc. Such relevant informationfrom different sources about the user may be further explored todetermine the dialogue initiation/response strategy.

In addition, the contextual info processor 1010 receives, at 1120,contextual information from the dialogue contextual info determiner 530(see FIG. 5) and analyzes, at 1130, the received contextual informationto identify relevant contextual information. For instance, thecontextual info processor 1010 may base on the received contextualinformation that identifies certain objects in the scene (e.g., desk,chair, computer, . . . , bookcase) and conclude that the user is in anoffice. In other situations, if the contextual information receivedreports other types of objects such as trees, blue sky, benches, swing,. . . , etc., the contextual info processor 1010 may estimate that theuser is in a park. Such conclusions based on the contextual informationmay also be explored in determining the dialogue initiation/responsestrategy.

To proceed to determine how to advance the dialogue, depending onwhether it is a new dialogue or an on-going dialogue, relevantinformation such as a state of the dialogue is extracted, at 1140, bythe dialogue progression determiner 1040 from, e.g., the user stateinformation and is used to determine, at 1150, the mode of operationaccordingly. If it is a new dialogue, the dialogue progressiondeterminer 1040 invokes the dialogue initiation determiner 1030 todetermine an initiation strategy. Otherwise, the dialogue modedeterminer 1040 invokes the dialogue response determiner 1020 togenerate a response strategy.

Upon being invoked, the dialogue initiation determiner 1030 analyzes theprocessing results from both the user information processor 1000 and thecontextual info processor 1010 to understand the user and thesurroundings of the user. For instance, the user may be a young boy withbrown eyes, dark skin, and curly hair, wearing a yellow shirt (from userstate information), and is known to be a slow warmer (i.e., a shyperson) when it comes to meeting and conversing with strangers (fromuser utilities). The processing results from the contextual infoprocessor 1010 may indicate that the user is currently in an office,which has a desk, a chair, a computer on the table, and a box labeled asa lego set also on the table.

Based on what is known about the current user (a slow warmer), thedialogue initiation determiner 1030 may decide, e.g., based on machinelearned models, that it may be more appropriate to start the conversionfirst by talking about something a young boy may like to get the usercomfortable before engaging the user into conversation intended (e.g., aprogram that the user has signed up for learning math). To initiateappropriately with respect to the current user given the contextualinformation about the surroundings of the user, the dialogue initiationdeterminer 1030 may generate, at 1160, an individualized dialogueinitiation strategy that leverages what is observed in the scene anddetermine how to initially approach the user. For example, the dialogueinitiation determiner 1030 may leverage the lego set observed on thedesk and decide to start the conversation by asking “Do you like lego?”Alternatively, the dialogue initiation determiner 1030 may leverage theobserved appearance of the user and decide to start the conversation bysaying “You look very cheerful today!” Starting a conversation this wayis based on a personalized strategy with respect to the user, enablingbetter user experience and enhancing engagement. In this way, a dialogueinitiation strategy for each user may differ but is intelligentlyadaptive to each user.

In terms of determining a response to the user in an on-going dialogue,the dialogue response determiner 1020 may, in a normal situation as whata traditional dialogue system will do, determine a response based on,e.g., what is dictated in a relevant dialogue tree. However, a dialoguemay not always go as planned and in some situations, the machine agentin the dialogue needs to be more adaptive and creative in order tocontinue to engage a user without unknowingly annoying the user. Whensuch a situation is detected (e.g., by understanding the situation basedon multimodal sensor data and analysis thereof), the dialogue responsedeterminer 1020 may perform similar functions as the dialogue initiationdeterminer 1030 by leveraging relevant information related to the userstate and the context of the surrounding environment the user is in.

Compared with a determination on how to initiate a conversation,determining a response for an on-going dialogue, additional operationalparameters need to be considered. For example, because it is an on-goingdialogue, there is a dialogue history which may impact the response tobe decided. In addition, the current user state (e.g., user's emotion)associated with a current state of the corresponding dialogue tree mayalso impact the decision on an appropriate response, either along thesame dialogue tree or outside of the dialogue tree. For example, if auser is in the middle of a conversation with the machine agent ongeometry and the user got several answers incorrectly (indicated in thedialogue history). The machine agent may also have observed that theuser appears to be frustrated. With this situation (which is observablefrom the dialogue history and the multimodal data acquired from theuser), the dialogue response determiner 1020 may determine, based oncertain machine learned models (which may dictate that when the user isfrustrated due to incorrect answers, it is appropriate to digress inorder not to lose the user), that the user needs to be distracted a bitfrom the current topic.

On what the machine agent should say to implement the strategy to switchthe user's attention to something else, the dialogue response determiner1020 may leverage what is observed in the scene of the user or thecontext of the scene. In some situations, the dialogue responsedeterminer 1020 may leverage the detected appearance of the user anddevise some distracting response. For instance, if it is observed thatthe user is wearing a shirt with text “Red Sox” printed thereon, thedialogue response determiner 1020 may decode to ask the user “Are you afan of Red Sox?” In some situations, the dialogue response determiner1020 may leverage what is observed in the scene (context) to devise aneeded and relevant response. For example, if the contextual informationindicates that the user is presently in an office and the office sceneincludes a lego on a desk, the dialogue response determiner 1020 maydecide to ask the user “Do you like Lego?” with the intent that it willlater use the lego in the scene to continue the discussion on geometry(e.g., “what is the shape of this lego piece?”). Based on suchintelligent and dynamic processing, the dialogue response determiner1020 may determine, at 1150, an appropriate response in considering theactual dialogue situation observed. Such a response generated for anon-going dialogue session is accordingly adapted to not only eachindividual user but also to each situation observed at the moment.

The feedback, either an initial thing to say to the user generated bythe dialogue initiation determiner 1030 or a response to a user in anon-going dialogue generated by the dialogue response determiner 1020, isthen sent to the dialogue feedback generator 1050, which generates, at1180, a feedback to be provided to the user and sends, at 1190, suchgenerated feedback to the feedback instruction generator 550 (see FIG.5), where appropriate rendering instruction is to be generated andprovided to the user device to render the feedback to the user.

FIG. 12 depicts the architecture of a mobile device which can be used torealize a specialized system, either partially or fully, implementingthe present teaching. In this example, the user device on which contentis presented and interacted-with is a mobile device 1100, including, butis not limited to, a smart phone, a tablet, a music player, a handledgaming console, a global positioning system (GPS) receiver, and awearable computing device (e.g., eyeglasses, wrist watch, etc.), or inany other form factor. The mobile device 1200 in this example includesone or more central processing units (CPUs) 1240, one or more graphicprocessing units (GPUs) 1230, a display 1220, a memory 1260, acommunication platform 1210, such as a wireless communication module,storage 1290, and one or more input/output (I/O) devices 1250. Any othersuitable component, including but not limited to a system bus or acontroller (not shown), may also be included in the mobile device 1200.As shown in FIG. 12, a mobile operating system 1270, e.g., iOS, Android,Windows Phone, etc., and one or more applications 1280 may be loadedinto the memory 1260 from the storage 1290 in order to be executed bythe CPU 1240. The applications 1280 may include a browser or any othersuitable mobile apps for receiving and rendering content streams on themobile device 1200. Communications with the mobile device 1200 may beachieved via the I/O devices 1250.

To implement various modules, units, and their functionalities describedin the present disclosure, computer hardware platforms may be used asthe hardware platform(s) for one or more of the elements describedherein. The hardware elements, operating systems and programminglanguages of such computers are conventional in nature, and it ispresumed that those skilled in the art are adequately familiar therewithto adapt those technologies to query to ads matching as disclosedherein. A computer with user interface elements may be used to implementa personal computer (PC) or other type of work station or terminaldevice, although a computer may also act as a server if appropriatelyprogrammed. It is believed that those skilled in the art are familiarwith the structure, programming and general operation of such computerequipment and as a result the drawings should be self-explanatory.

FIG. 13 depicts the architecture of a computing device which can be usedto realize a specialized system implementing the present teaching. Sucha specialized system incorporating the present teaching has a functionalblock diagram illustration of a hardware platform which includes userinterface elements. The computer may be a general purpose computer or aspecial purpose computer. Both can be used to implement a specializedsystem for the present teaching. This computer 1300 may be used toimplement any component of the present teaching, as described herein.For example, the dialogue manager, the dialogue controller, etc., may beimplemented on a computer such as computer 1300, via its hardware,software program, firmware, or a combination thereof. Although only onesuch computer is shown, for convenience, the computer functions relatingto the present teaching as described herein may be implemented in adistributed fashion on a number of similar platforms, to distribute theprocessing load.

The computer 1300, for example, includes COM ports 1350 connected to andfrom a network connected thereto to facilitate data communications. Thecomputer 1300 also includes a central processing unit (CPU) 1320, in theform of one or more processors, for executing program instructions. Theexemplary computer platform includes an internal communication bus 1310,program storage and data storage of different forms, e.g., disk 1370,read only memory (ROM) 1330, or random access memory (RAM) 1340, forvarious data files to be processed and/or communicated by the computer,as well as possibly program instructions to be executed by the CPU. Thecomputer 1300 also includes an I/O component 1360, supportinginput/output flows between the computer and other components thereinsuch as user interface elements 1380. The computer 1300 may also receiveprogramming and data via network communications.

Hence, aspects of the methods of enhancing ad serving and/or otherprocesses, as outlined above, may be embodied in programming. Programaspects of the technology may be thought of as “products” or “articlesof manufacture” typically in the form of executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Tangible non-transitory “storage” type media includeany or all of the memory or other storage for the computers, processorsor the like, or associated modules thereof, such as varioussemiconductor memories, tape drives, disk drives and the like, which mayprovide storage at any time for the software programming.

All or portions of the software may at times be communicated through anetwork such as the Internet or various other telecommunicationnetworks. Such communications, for example, may enable loading of thesoftware from one computer or processor into another, for example, froma management server or host computer of a search engine operator orother systems into the hardware platform(s) of a computing environmentor other system implementing a computing environment or similarfunctionalities in connection with query/ads matching. Thus, anothertype of media that may bear the software elements includes optical,electrical and electromagnetic waves, such as used across physicalinterfaces between local devices, through wired and optical landlinenetworks and over various air-links. The physical elements that carrysuch waves, such as wired or wireless links, optical links or the like,also may be considered as media bearing the software. As used herein,unless restricted to tangible “storage” media, terms such as computer ormachine “readable medium” refer to any medium that participates inproviding instructions to a processor for execution.

Hence, a machine-readable medium may take many forms, including but notlimited to, a tangible storage medium, a carrier wave medium or physicaltransmission medium. Non-volatile storage media include, for example,optical or magnetic disks, such as any of the storage devices in anycomputer(s) or the like, which may be used to implement the system orany of its components as shown in the drawings. Volatile storage mediainclude dynamic memory, such as a main memory of such a computerplatform. Tangible transmission media include coaxial cables; copperwire and fiber optics, including the wires that form a bus within acomputer system. Carrier-wave transmission media may take the form ofelectric or electromagnetic signals, or acoustic or light waves such asthose generated during radio frequency (RF) and infrared (IR) datacommunications. Common forms of computer-readable media thereforeinclude for example: a floppy disk, a flexible disk, hard disk, magnetictape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any otheroptical medium, punch cards paper tape, any other physical storagemedium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM,any other memory chip or cartridge, a carrier wave transporting data orinstructions, cables or links transporting such a carrier wave, or anyother medium from which a computer may read programming code and/ordata. Many of these forms of computer readable media may be involved incarrying one or more sequences of one or more instructions to a physicalprocessor for execution.

Those skilled in the art will recognize that the present teachings areamenable to a variety of modifications and/or enhancements. For example,although the implementation of various components described above may beembodied in a hardware device, it may also be implemented as a softwareonly solution—e.g., an installation on an existing server. In addition,the enhanced ad serving based on user curated native ads as disclosedherein may be implemented as a firmware, firmware/software combination,firmware/hardware combination, or a hardware/firmware/softwarecombination.

While the foregoing has described what are considered to constitute thepresent teachings and/or other examples, it is understood that variousmodifications may be made thereto and that the subject matter disclosedherein may be implemented in various forms and examples, and that theteachings may be applied in numerous applications, only some of whichhave been described herein. It is intended by the following claims toclaim any and all applications, modifications and variations that fallwithin the true scope of the present teachings.

We claim:
 1. A method, implemented on a machine having at least oneprocessor, storage, and a communication platform for initiating adialogue with a user, comprising: receiving, by a dialogue agent via thecommunication platform, information capturing a user and a surroundingscene, wherein the user is to be engaged in a new dialogue in the scene;estimating a state of the user based on features extracted from theinformation; selecting, from a plurality of programs, a program on afirst topic for the user, wherein the program on the first topiccorresponds to an intended dialogue; determining whether the intendeddialogue on the first topic associated with the first program is to beinitiated given the state of the user; initiating the intended dialoguewith the user as the new dialogue, if the intended dialogue isappropriate according to the state of the user; and initiating, if theintended dialogue is currently inappropriate for the user, analternative dialogue on an alternative topic with the user as the newdialogue, wherein the alternative topic is determined based on the stateof the user, observed features of the user, and a dialogue context ofthe scene determined based on the relevant features.
 2. The method ofclaim 1, wherein when the state of the user indicates that the user islikely able to engage in the intended dialogue on the first topic, theintended dialogue is considered appropriate in the state of the user;and when the state of the user indicates that the user likely is notable to engage in the intended dialogue on the first topic, the intendeddialogue is considered not appropriate in the state of the user.
 3. Themethod of claim 1, wherein the information includes multimodal sensordata in at least one of audio, visual, textual, and haptic modalities,where the audio sensor data record acoustic sound from the scene and/ora speech from the user.
 4. The method of claim 1, wherein the state ofthe user characterizes at least one of: an appearance of the userobserved in the scene; an expression of the user estimated based on theinformation acquired from the scene; one or more emotions of the userinferred based on the expression of the user; and an intent of the userinferred based on at least one of the expression and the one or moreemotions.
 5. The method of claim 1, wherein the dialogue contextincludes at least one of: at least one object present in the scene and acharacterization thereof; an estimated classification of the scene; acharacterization of the scene; and a sound heard from the environment ofthe scene.
 6. The method of claim 1, wherein the first topic is one of asubject matter related to the program that the user previously signedup; a subject matter dynamically determined; and a combination thereof.7. The method of claim 1, wherein the new dialogue is initiated via atleast one of speech and visual means.
 8. Machine readable andnon-transitory medium having information recorded thereon for initiatinga dialogue with a user, wherein the information, once read by themachine, causes the machine to perform the following steps: receiving,by a dialogue agent via the communication platform, informationcapturing a user and a surrounding scene, wherein the user is to beengaged in a new dialogue in the scene; estimating a state of the userbased on features extracted from the information; selecting, from aplurality of programs, a program on a first topic for the user, whereinthe program on the first topic corresponds to an intended dialogue;determining whether the intended dialogue on the first topic associatedwith the first program is to be initiated given the state of the user;initiating the intended dialogue with the user as the new dialogue, ifthe intended dialogue is appropriate according to the state of the user;and initiating, if the intended dialogue is currently inappropriate forthe user, an alternative dialogue on an alternative topic with the useras the new dialogue, wherein the alternative topic is determined basedon the state of the user, observed features of the user, and a dialoguecontext of the scene determined based on the relevant features.
 9. Themedium of claim 8, wherein when the state of the user indicates that theuser is likely able to engage in the intended dialogue on the firsttopic, the intended dialogue is considered appropriate in the state ofthe user; and when the state of the user indicates that the user likelyis not able to engage in the intended dialogue on the first topic, theintended dialogue is considered not appropriate in the state of theuser.
 10. The medium of claim 8, wherein the information includesmultimodal sensor data in at least one of audio, visual, textual, andhaptic modalities, where the audio sensor data record acoustic soundfrom the scene and/or a speech from the user.
 11. The medium of claim 8,wherein the state of the user characterizes at least one of: anappearance of the user observed in the scene; an expression of the userestimated based on the information acquired from the scene; one or moreemotions of the user inferred based on the expression of the user; andan intent of the user inferred based on at least one of the expressionand the one or more emotions.
 12. The medium of claim 8, wherein thedialogue context includes at least one of: at least one object presentin the scene and a characterization thereof; an estimated classificationof the scene; a characterization of the scene; and a sound heard fromthe environment of the scene.
 13. The medium of claim 8, wherein thefirst topic is one of a subject matter related to the program that theuser previously signed up; a subject matter dynamically determined; anda combination thereof.
 14. The medium of claim 8, wherein the newdialogue is initiated via at least one of speech and visual means.
 15. Asystem for initiating a dialogue with a user, comprising: a multimodaldata analysis unit implemented on a processor and configured forreceiving, by a dialogue agent via the communication platform,information capturing a user and a surrounding scene, wherein the useris to be engaged in a new dialogue in the scene, and extracting featuresfrom the information; a user state estimator implemented on a processorand configured for estimating a state of the user based on the featuresextracted from the information; a dialogue controller implemented on aprocessor and configured for selecting, from a plurality of programs, aprogram on a first topic for the user, wherein the program on the firsttopic corresponds to an intended dialogue, determining whether theintended dialogue on the first topic associated with the first programis to be initiated given the state of the user, initiating the intendeddialogue with the user as the new dialogue, if the intended dialogue isappropriate according to the state of the user, and initiating, if theintended dialogue is currently inappropriate for the user, analternative dialogue on an alternative topic with the user as the newdialogue, wherein the alternative topic is determined based on the stateof the user, observed features of the user, and a dialogue context ofthe scene determined based on the relevant features.
 16. The system ofclaim 15, wherein when the state of the user indicates that the user islikely able to engage in the intended dialogue on the first topic, theintended dialogue is considered appropriate in the state of the user;and when the state of the user indicates that the user likely is notable to engage in the intended dialogue on the first topic, the intendeddialogue is considered not appropriate in the state of the user.
 17. Thesystem of claim 15, wherein the information includes multimodal sensordata in at least one of audio, visual, textual, and haptic modalities,where the audio sensor data record acoustic sound from the scene and/ora speech from the user.
 18. The system of claim 15, wherein the state ofthe user characterizes at least one of: an appearance of the userobserved in the scene; an expression of the user estimated based on theinformation acquired from the scene; one or more emotions of the userinferred based on the expression of the user; and an intent of the userinferred based on at least one of the expression and the one or moreemotions.
 19. The system of claim 15, wherein the dialogue contextincludes at least one of: at least one object present in the scene and acharacterization thereof; an estimated classification of the scene; acharacterization of the scene; and a sound heard from the environment ofthe scene.
 20. The system of claim 15, wherein the first topic is one ofa subject matter related to the program that the user previously signedup; a subject matter dynamically determined; and a combination thereof.21. The system of claim 15, wherein the new dialogue is initiated via atleast one of speech and visual means.